Word Co-occurrence Counts Prediction for Bilingual Terminology Extraction from Comparable Corpora
نویسندگان
چکیده
Methods dealing with bilingual lexicon extraction from comparable corpora are often based on word co-occurrence observation and are by essence more effective when using large corpora. In most cases, specialized comparable corpora are of small size, and this particularity has a direct impact on bilingual terminology extraction results. In order to overcome insufficient data coverage and to make word co-occurrence statistics more reliable, we propose building a predictive model of word co-occurrence counts. We compare different predicting models with the traditional Standard Approach (Fung, 1998) and show that once we have identified the best procedures, our method increases significantly the performance of extracting word translations from comparable corpora.
منابع مشابه
Bilingual Word Embeddings for Bilingual Terminology Extraction from Specialized Comparable Corpora
Bilingual lexicon extraction from comparable corpora is constrained by the small amount of available data when dealing with specialized domains. This aspect penalizes the performance of distributionalbased approaches, which is closely related to the reliability of word’s cooccurrence counts extracted from comparable corpora. A solution to avoid this limitation is to associate external resources...
متن کاملEfficient Data Selection for Bilingual Terminology Extraction from Comparable Corpora
Comparable corpora are the main alternative to the use of parallel corpora to extract bilingual lexicons. Although it is easier to build comparable corpora, specialized comparable corpora are often of modest size in comparison with corpora issued from the general domain. Consequently, the observations of word co-occurrences which are the basis of context-based methods are unreliable. We propose...
متن کاملLooking at Unbalanced Specialized Comparable Corpora for Bilingual Lexicon Extraction
The main work in bilingual lexicon extraction from comparable corpora is based on the implicit hypothesis that corpora are balanced. However, the historical contextbased projection method dedicated to this task is relatively insensitive to the sizes of each part of the comparable corpus. Within this context, we have carried out a study on the influence of unbalanced specialized comparable corpo...
متن کاملImproving Bilingual Terminology Extraction from Comparable Corpora via Multiple Word-Space Models
There is a rich flora of word space models that have proven their efficiency in many different applications including information retrieval (Dumais et al., 1988), word sense disambiguation (Schütze, 1993), various semantic knowledge tests (Lund et al., 1995; Karlgren and Sahlgren, 2001), and text categorization (Sahlgren and Karlgren, 2005). Based on the assumption that each model captures some...
متن کاملFrench-English Terminology Extraction from Comparable Corpora
This article presents a method of extracting bilingual lexica composed of single-word terms (SWTs) and multi-word terms (MWTs) from comparable corpora of a technical domain. First, this method extracts MWTs in each language, and then uses statistical methods to align single words and MWTs by exploiting the term contexts. After explaining the difficulties involved in aligning MWTs and specifying...
متن کامل